Coursera Capstone Project - The Battle of Neighbourhoods

Analysis of London residential housing market

January 2021 - Fabrizio Boffa

Introduction

London is a fascinating city with many faces and, as a background of this project, we will assume that a group of stakeholders has already decided to invest in its real estate market.

Given that, the objective of this project is to find newly built residential properties that match their investment profile, on one hand, and possess the potential to increase their future value, on the other. We will accomplish this by defining a rating system based on the residential properties characteristics, on the relevant trends about the area they are located in and on their surrounding venues.

First, we will look for available datasets from which we can gather some insights about future housing market trends in various areas of London. Afterwards we will pick the most promising areas and look for residential unit on sale in their perimeter browsing housing specialized websites. We will then inspect each property surroundings using the Foursquare API to see what kind of venues are nearby and further characterize each property.

Combining all these information about local housing market trends, properties features and their surroundings, this document proposes and enforces a methodology to approach the residential housing market so that an investor is finally able to make an aware buying decision.

Even though this document is intended for a group of stakeholders that is interested in the London housing market for a real estate investment, it is also interesting for the small investor that wants to settle in London buying a new home with certain characteristics and surroundings and that has a good chance to increase its value in the future.

Data

We will acquire data from different sources on three main subjects:

  • Local Areas
  • Actual residential properties on sales
  • Surrounding venues

As local areas, we will use the London boroughs geographical subdivision and the main source of data will be the London Datastore https://data.london.gov.uk/. It has been created by the Greater London Authority (GLA) as a first step towards freeing London’s data. It will provide us with a huge amount of historical data and projections grouped by boroughs or wards on a wide range of topics.
Looking at available London data at borough level, we will pick those that likely will have an impact on the housing market trend, namely:

  • Housing market historical trends
  • Demographic historical trend and predictions
  • Income historical trend
  • Future residential development
  • Crimes historical trend
  • Air pollution historical trends and predictions

As said, all the information have been acquired from London Datastore, apart from City of London Crimes data which comes from the UKCrimeStats website (www.ukcrimestats.com), an open data platform of the Economic Policy Centre (www.economicpolicycentre.com). In the methodology section of this document we will further describe every dataset analyzed that will be mainly in the time series form. Nevertheless, the common goal will be to produce a ratings system that synthesizes the impact of the described local feature on the future housing market value in that same area.

Afterwards, we will acquire data about actual residential properties analyzing one of the most important UK website for properties for sales: www.rightmove.co.uk.
This source will provide us with these information about actual residential property on sales in the London area:

  • Type of residential unit
  • Location coordinates
  • Number of bedrooms
  • Size
  • Price

These data will give us the ability to estimate further property characteristics like "spaciousness" and "affordability" to better match investor preferences.

Finally, with the location coordinates acquired in the previous step, we will use the Foursquare API to acquire nearby venues. We will subdivide venues in these categories:

  • Outdoors & Entertainment
  • Services
  • Food and Nightlife
  • Transports

and count the number of venues nearby in each category to characterize the surroundings of each residential unit.

At this point, we will have all the information to completely rate each single property. We will then define a final rating system that we could match with an investor profile to obtain a personalized recommended buying list.

Methodology

It is divided in two phases.

In the first phase we will assess the boroughs quality in relation to a potential residential investment.
We will analyze the various data collected to evaluate predictions for the imminent future. Based on these predictions and the impact of each feature on the housing market value, we will define for each borough a rating on each subject analyzed.
Predictions about the imminent future are fundamental for our goal. We have to consider that the current value of each feature is correlated to the current housing market value, so by itself is not sufficient to give us the information we want, namely the likelihood of a property value increase. Indeed, the current value define both the current selling and buying prices so it is impossible to gain information about what will be the prices in the future. In other words, if we invest today we want to find the borough where there's a high possibility of a value increase of our investment.

The mentioned rating will reflect the impact on the housing market of the selected study topics and for each one of them every borough will be graded from zero (worst) to one (best). A good rating will mean that feature will likely contribute to a positive increment of house market value in that borough.

The rating itself will be evaluated with the relative increment between the future value and the current value of the given feature with the formula:

$\frac{v_f - v_c}{v_c}$

We will assess the future values at one year from December 2020.
With all the increments calculated for a given feature we will obtain the final ratings simply applying a min-max normalization in the zero-one range so that for every feature we will have a best borough rated at one and a worst rated at zero.

Almost all the dataset we will analyze in this phase are time series and some of them contain also predictive data. In the lack of a future value, we will extrapolate one using a polynomial regression model and, since we want to capture just the overall trend, we will keep a low polynomial degree.

All the ratings will be stored in a dataframe and we will perform clustering analysis using K-means and DBSCAN algorithms to obtain some clusters with interesting characteristics with the purpose of selecting some promising boroughs that we will use for the second phase.

In the second phase we will use a web scraping algorithm to acquire data about new residential property on sales in the most promising boroughs and run several iterations of the Foursquare "explore" API to find surrounding venues for each property.
Based on the data gathered we will run several clustering algorithm to extract the most interesting groups of properties that are suitable for our goal.

First phase - Boroughs ratings

Collecting boroughs data

We start with the construction of a dataframe containing basic information about the 33 London boroughs:

  • Borough Name (as index)
  • Authority Code
  • Area in square kilometres
  • Population density esteem at current year

We will use the authority codes as a base to uniform boroughs naming among the datasets and the others for the valuation of the various grades.
From this dataframe we derive also an empty dataframe we will fill with boroughs ratings in the various next steps.
Below, the reader finds the aforementioned boroughs table.

Out[4]:
Code Km2 Pop/Km2
Borough
Barking and Dagenham E09000002 36.1 6047.6
Barnet E09000003 86.7 4693.4
Bexley E09000004 60.6 4208.8
Brent E09000005 43.2 7953.6
Bromley E09000006 150.1 2240.6
Camden E09000007 21.8 11812.4
City of London E09000001 2.9 2770.7
Croydon E09000008 86.5 4627.2
Ealing E09000009 55.5 6514.3
Enfield E09000010 80.8 4243.3
Greenwich E09000011 47.3 6178.8
Hackney E09000012 19.0 15197.0
Hammersmith and Fulham E09000013 16.4 11631.1
Haringey E09000014 29.6 9761.1
Harrow E09000015 50.5 5179.0
Havering E09000016 112.3 2330.5
Hillingdon E09000017 115.7 2733.4
Hounslow E09000018 56.0 5079.8
Islington E09000019 14.9 16344.7
Kensington and Chelsea E09000020 12.1 13262.4
Kingston upon Thames E09000021 37.3 4904.0
Lambeth E09000022 26.8 12693.9
Lewisham E09000023 35.1 9031.9
Merton E09000024 37.6 5652.1
Newham E09000025 36.2 10043.3
Redbridge E09000026 56.4 5534.7
Richmond upon Thames E09000027 57.4 3513.7
Southwark E09000028 28.9 11443.8
Sutton E09000029 43.8 4819.0
Tower Hamlets E09000030 19.8 16583.5
Waltham Forest E09000031 38.8 7473.3
Wandsworth E09000032 34.3 9710.8
Westminster E09000033 21.5 12099.5
'Likely, if it went well in the past it will go well also in the immediate future'

Analysing the housing market historical series is surely important to comprehend the possible future trend.
The importance of this dataset is in the fact that all the other data we will collect about boroughs have only a partial impact on the future housing market trends. In other words, there are surely other influential components that we cannot assess simply because they are not recorded and so they are not measurable. However, reading the historical housing market trends will give us information about how those hidden components acted in the past. Consequently, when we predict future based on these data, somehow we will keep them into account.

For this task we will use data from 1995 to 2020 available on London Datastore website. The main source of this dataset is the UK HM Land Registry. From this dataset we will acquire, for each borough, the historical average house sales prices and we will extrapolate future values using a polynomial regression model. Specifically we will fit two regression models for each borough using short term data and long term data. That will give us two regression curves that represent the short and the long term housing market trend. We will than acquire both curve point at future time to define a predicted value and finally we will calculate the relative increase from the current value to future value.

Finally, we will assign two ratings scaling all the relative increases foreseen by the short term model so that the most performing borough will be graded one and the least performing will be graded zero. Of course, we will do the same for the long term model.

This phase is concluded with the recording of the so evaluated ratings in the ratings dataframe.

In the following plots, we can see the various observations during the years and the second degree regression curves that best fits the data. Visually, the more the right end of the curves points up the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.

Demographics

'The more the buyers, the higher the prices'

A population increase is a direct cause of residential properties demand increase and, on final, of housing market value increase. This said, is important for our scope to verify demographics data. We will do this using datasets from London Datastore which include, for each London borough, Greater London Authority demographics estimates (2016-based projections), 2011 Census and mid-year estimates by UK Office for National Statistics. Luckily for us, this dataset contains also predictions on future years till 2050 so we will not need to use predictive models in this phase. To grade each borough, we will simply evaluate the relative increase of population density, from the current date to the future date, and we will perform the same scaling done in the previous phase. Finally we will store the data in the ratings dataframe.

In the following plots, we can see the various observations and predictions during the years and the slope of the line connecting the last two observations represents the relative increment. Visually, the more the line right end points up the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.

Income and earnings

'The more the money, the higher the prices'

In this paragraph we analyse trends of personal earnings from workers in different London boroughs. We assume that people tend to desire to live as near as possible to the workplace, so we will analyse a dataset that includes median income by workplace and not by residence.

The dataset is provided by UK Office for National Statistics (ONS) and provides information about earnings of employees who are working in an area, who are on adult rates and whose pay for the survey pay-period was not affected by absence.

As done previously, we will evaluate past trends with polynomial regression model and predict eventual short term future increases which we will then translate into boroughs ratings.

In the following plots, we can see the various observations during the years and the second degree regression curve that best fits the data. Visually, the more the right end of the curve points up the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.

Future residential development

'The more the sellers, the lower the prices'

An increase in housing availability in a specific area it is a negative factor because it will expand the offer and consequently will shrink the housing market value in that area. Therefore, we need to know how many residential properties will be available in the future and, luckily for us, through the London Datastore we have access to the housing approvals recorded on the London Development Database (LDD).

The LDD contains details of all planning consents meeting criteria agreed with the London boroughs, who are responsible for submitting data to the database. Only planning consents are recorded on the database. For details of applications being considered by local planning authority (borough), or for refusals, we should visit each relevant planning authority’s website. For the sake of simplicity, we will assume that the refusal rate is a constant percentage among all boroughs, thus will not affect our computations.

To rate each borough in this category we will assess how many permissions have been completed in the last 24 months. This data, considering building phase timings, will gives us a foresee about how many new residential units will be on the market in the next future. We also have to take into account that a given number of new residential properties impacts differently on each borough based on its demographics. Thus, we will divide the above value by the borough population to obtain a parameter we can use to compare boroughs among themselves.

The bar plot below shows for each borough the number of new residential units for a thousand of inhabitants that are about to enter the housing market in the next years. The rating itself is also noted above each bar and represented by its colour. Visually, the higher the bar, the worse the rating.

Crimes

'The more the criminality, the lower the prices'

It is glaring that crime level and land value are inversely proportional. To acquire crimes data we will use historical series provided by London Metropolitan Police Service through the London Datastore website. We will merge two time series, one for the last 24 months and the second for the less recent observations starting from April 2010.
The aforementioned dataset will not contain information about the City of London borough because City of London Police is responsible for the safety of everyone in the 'Square Mile', not the Metropolitan Police Service.
Since we want to complete the dataset, we will acquire this information from the first table in the webpage www.ukcrimestats.com/Police_Force/City_of_London_Police and merge the data. The final dataset contains crime observations from December 2010 upwards in every London borough.

In the following plots, we can see the various observations during the years and the second degree regression curve that best fits the data. Visually, the more the right end of the curve points down the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.

Air pollution

'The more the smog, the lower the prices'

We will use datasets from a study commissioned to the London King’s College London by Transport for London and the Greater London Authority. You can read the full report here: https://www.london.gov.uk/what-we-do/environment/pollution-and-air-quality/modelling-long-term-health-impacts-air-pollution-london.
This study used a computer simulation to estimate the long-term health impacts from 2016 to 2050 of the Ultra Low Emission Zone (ULEZ) and the wider suite of policies included in the London Environment Strategy (LES). Specifically, this study estimates the health impacts of the change in concentration of two pollutants: Nitrogen Dioxide (NO2) and Particulate Matter (PM2.5). These pollutants are known to have long-term health effects.
For each borough we will pick the sum of both NO2 and PM2.5 related diseases incidence for each year from 2016 to 2021 and the rating will be evaluated considering the relative increase of incidence from 2020 value to 2021 value. Naturally the lower will be the increase, the better the rating.

In the following plots, we can see the various observations and predictions during the years and the slope of the line connecting the last two observations represents the relative increment. Visually, the more the line right end points down the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.

Boroughs ratings table

Now that all the data about boroughs have been collected and analysed and boroughs performances have been measured, let's have a look at our final ratings dataframe displayed below.

In the next paragraph, we will try to use machine learning clustering algorithms to see if we can obtain some clusters populated with boroughs that are good candidates for further in-depth study.

Out[17]:
Housing short-term Housing long-term Demographic Area income Land development Crime Air pollution
Borough
Barking and Dagenham 0.05 0.43 0.42 0.21 0.98 0.26 1.00
Barnet 1.00 1.00 0.24 0.53 0.79 0.27 0.33
Bexley 0.43 0.53 0.11 0.76 0.96 0.14 0.67
Brent 0.83 0.85 0.12 0.89 0.44 0.30 0.28
Bromley 0.35 0.56 0.19 0.56 0.95 0.45 0.89
Camden 0.52 0.66 0.13 0.65 0.78 0.49 0.24
City of London 0.89 0.30 0.44 0.77 0.00 0.00 0.63
Croydon 0.67 0.83 0.14 1.00 0.58 0.24 0.51
Ealing 0.49 0.55 0.74 0.52 0.79 0.19 0.41
Enfield 0.61 0.72 0.26 0.69 0.96 0.10 0.35
Greenwich 0.79 0.75 0.15 0.62 0.73 0.23 0.26
Hackney 0.80 0.65 0.18 0.53 0.86 0.53 0.71
Hammersmith and Fulham 0.27 0.01 1.00 0.69 0.85 0.59 0.21
Haringey 0.68 0.64 0.16 0.47 0.99 0.46 0.30
Harrow 0.31 0.59 0.15 0.97 0.58 0.06 0.23
Havering 0.92 0.87 0.48 0.53 0.86 0.58 0.91
Hillingdon 0.55 0.72 0.24 0.58 0.84 0.51 0.27
Hounslow 0.96 0.59 0.19 0.86 0.82 0.29 0.51
Islington 0.60 0.35 0.07 0.77 0.96 1.00 0.13
Kensington and Chelsea 0.66 0.00 0.00 0.63 1.00 0.66 0.27
Kingston upon Thames 0.65 0.56 0.26 0.80 0.80 0.26 0.47
Lambeth 0.55 0.55 0.04 0.68 0.80 0.99 0.30
Lewisham 0.63 0.77 0.22 0.57 0.91 0.09 0.43
Merton 0.62 0.64 0.22 0.67 0.98 0.33 0.39
Newham 0.85 0.91 0.20 0.41 0.78 0.45 0.08
Redbridge 0.95 0.76 0.36 0.00 0.96 0.14 0.89
Richmond upon Thames 0.75 0.43 0.14 0.59 0.99 0.40 0.64
Southwark 0.33 0.47 0.11 0.85 0.85 0.47 0.41
Sutton 0.61 0.70 0.22 0.92 0.88 0.15 0.78
Tower Hamlets 0.28 0.52 0.27 0.47 0.06 0.45 0.29
Waltham Forest 0.61 0.94 0.21 0.76 0.78 0.28 0.81
Wandsworth 0.49 0.37 0.46 0.63 0.44 0.61 0.17
Westminster 0.00 0.18 0.19 0.85 0.80 0.17 0.00

Clustering boroughs

We will use K-means and DBSCAN algorithms for this task and, since they are both based on the concept of Euclidean distance, we will use a weights system applied on the features that will shrink the variation range of those we will assume less important for our scenario. Indeed shrinking a feature range means that samples "diversity" (distance) in that particular feature will be reduced, so it will produce unlikely a new clusters. Samples that are different by reduced range features, will most likely end up spreading in clusters created by samples that are different by full range features.

That said, according to our personal feelings about how the discussed topics could affect the housing market in the upcoming years, here is the weights we will use:

  • Housing market short term trend: 100%
  • Housing market long term trend: 75%
  • Demographic trend: 100%
  • Income trend: 50%
  • New residential units: 50%
  • Crime trend: 100%
  • Air pollution trend: 25%

First let's have a look at K-means clustering. This algorithm aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid) so as to minimize the within-cluster sum of squares (variance). Since the number of cluster has to be given as a starting parameter for the algorithm, a trial and error approach has to be taken to assess the optimal number of clusters to consider.

To do so, we will iterate the algorithm with various starting number of clusters and random centroids positions and we will measure its performances using two different metrics: Silhouette and Inertia.
Silhouette score tells how far away the datapoints in one cluster are, from the datapoints in another cluster. The range of silhouette score is from negative one (worst) to positive one (best). Values near zero indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
Inertia tells how far away the points within a cluster are. The range of inertia’s value starts from zero (best) and goes up.

In the picture below we can see the trends of inertia and silhouette score for various iterations of the K-means algorithm at different number of clusters values and random centroids starting positions. Only the random position that generate the best silhouette score is showed in the plot. Overall 140 iterations have been evaluated and we can identify two interesting points at number of clusters equal three and four. Both points have a silhouette score slightly above 0,3. We think point at four is a bit more interesting even if it has a tiny smaller silhouette score. At this point the inertia is smaller and its plot shows also a small elbow. The elbow point in the inertia graph is interesting because after that the change in the value of inertia is less significant.

Analisys report
Combinations of N° of clusters and N° of random states evaluated: 140
Best silhouette score: 0.309
Best silhouette score N° of clusters: 3
Best silhouette score random state: 1

In the following picture we further analyse the K-means algorithm output with number of clusters set at 4. What we understand immediately, watching the ratings distribution in each cluster, is that there is not a group that performs well in all categories.
Focusing on the more relevant features according to the previous weights defined, let's point out some clues we can see:

  • Cluster N°0 contains boroughs with the best predictions about future criminality level. They also perform above average in the short-term housing market trend. It's likely that some interesting boroughs are in this cluster.
  • Cluster N°1 is characterized by best performing boroughs in demographic trend and slightly above average criminality trend. Housing market trends are slightly sub average but we could find some interesting boroughs also in this cluster.
  • Cluster N°2, the most populated, shows that majority of boroughs have above average housing market trends. Unfortunately, they also show a sub average crime trend and demographic trend with some exceptions. Since is the most populated cluster, we can surely find some interesting boroughs in it.
  • Cluster N°3 contains boroughs that perform sub average in every relevant features apart from long-term housing market trends that are in the average. It's unlikely that interesting boroughs are in this cluster.

Finally, let's have a look at the K-means clusters composition in the graph below.

Density-based spatial clustering of applications with noise (DBSCAN) is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together if their distance is less than "epsilon" and their number is more or equal to "min samples". It starts with an arbitrary starting point that has not been visited. This point's neighbourhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labelled as noise.

We will iterate through several values of epsilon and min samples values until we find a satisfying output. To monitor the quality of each iteration we will keep track of three parameters:

  • Number of cluster formed
  • Noise, that is the number of not clustered samples at the end of the process
  • Largest cluster size

Increasing epsilon reduce the noise, because more and more isolated points are reached by larger and larger growing clusters, and augments the size of the largest cluster, since all the clusters formed tend to merge the one each other. If the data are effectively clusterable, we should find a trade-off value where the noise is low but the largest cluster size is not already too much high.

The graph below shows the aforementioned parameters tendency at different values of epsilon and minimum sample size. We can observe how data tend to form a unique big cluster surrounded by noisy samples. We think an interesting point is at epsilon equal 0.41 and min samples equal to 2 with 3 clusters formed and noise equal to 6.

Analisys report
Combinations of N° of epsilon and N° of minimum samples evaluated: 140

In the following picture we further analyse the DBSCAN algorithm output with the aforementioned parameters.

Focusing on the more relevant features we can see:

  • Cluster N°0, the most populated, shows that majority of boroughs have above average housing market trends. Unfortunately they also show a sub average crime trend and demographic trend with some exceptions. We expect its composition to be similar to K-means cluster N°2.
  • Cluster N°1, has two best performers in crime trend and average performers in housing market trend. Unfortunately these boroughs are also the worst performers in demographic trend. We expect its composition to be similar to K-means cluster N°0.
  • Cluster N°2, has two boroughs also and they perform sub average in all relevant category except crime trend. It has no comparable K-means cluster.
Analisys report
Total number of clustered samples: 27
Number of unclustered samples: 6

Again, let's have a look at the DBSCAN clusters composition in the graph below.

Borough "Residential Investment Score"

So far we have understood that each boroughs has some pros and cons and similarly the clusters formed by the two algorithm used. To pick some interesting boroughs to further investigate in the residential properties evaluation stage, we need to summarize the various ratings for each one of them so that we end up with an overall measure.
We will call this measure the "Residential Investment Score" (RIS) and it will be the weighted sum of all the previous ratings, using the weights system already defined. The score itself will be scaled with a min-max normalization in the zero-one range, one will be the best and zero the worst.

With this new parameter it's now straightforward how to pick the right boroughs. Indeed in the bar plot below, we can see the RIS score calculated for every borough on the y axis. For visual reference, the highest and greenest the bar the better the score.

As we can see we have an uncontested winner by a margin of more than 20%.

Focusing on the best performers above average RIS score (0,6) we can see that they are included in all K-means clusters apart N°3, as predicted, and in DBSCAN clusters N°0 and N°1.

Final boroughs ratings summary

Before going deep into the residential properties analysis phase, let's have an overlook at what we have finally found about boroughs features and their relations to future housing market value.

In the bar plots below we can see the ratings achieved by each borough, the ratings mean value and the Residential Investment Score (RIS). These plots are presented in an interactive form and the reader can choose which borough to display for a one to one detailed comparison. Initially, the best and the worst RIS performer are shown by default.

In the choropleth map below we can further visualize how each borough score in the various category that we can choose interactively by the legend radio buttons. Hovering the mouse on a borough, we can display all its ratings.

Inspecting this map we can discover how the geography influences some way the various performances.
We can see:

  • how the outer boroughs perform better than the central ones in the housing market trend features.
  • how the demographic best increase is predicted in few boroughs, namely "Hammersmith and Fulham" and "Ealing", with all the rest in the sub-average area.
  • how a concentrated drop in income is predicted in "Redbridge" and "Barking and Dagenham", with all the rest in the over-average area.
  • how residential land development (per inhabitants) is focused on the central boroughs of "Tower Hamlets" and "City of London".
  • how crime trend improvement is better in few central boroughs while all the rest are in the sub-average area.
  • how the air pollution trend in better in the south-eastern boroughs rather than in the north-western ones.

Finally, we can see how the RIS score tells us that, with some exceptions, outer boroughs seem more promising than inner boroughs for a residential investment.

Out[26]:

Now that we have an overall measure of boroughs performance in term of residential investment quality, we can pick those we will focus on in the next stage.
We decided to consider only boroughs with a RIS score above 0,6. In the graph below we have a final summary view of those.

Second phase - Residential properties selection and evaluation

Acquisition of on sales residential properties data

For this task we will use a web scraping algorithm applied on the Rightmove website. We will pick residential properties on sales in the London area with these characteristics:

  • New home, recently built
  • Selling price between 50.000 and 5.000.000 GBP
  • Number of beds less or equal 5
  • Land and park home property type excluded
  • Buying schemes exclude

Data acquired are:

We first filter the data removing rows containing null values and duplicates. Then, size is converted to square meters and four new columns are computed:

  • thousands GBP per square meter, which gives us information on the "affordability" of the property
  • Square meters per bedroom, which gives us information on the "spaciousness" of the property
  • Borough, to transfer on the residential properties all the information acquired in the previous phase
  • Straight distance in kilometres from London "centre", which we will assume are the location coordinates of Trafalgar Square.

Finally, a check on size value is performed removing unlikely square meters per bedrooms values and properties outside interesting boroughs are removed.

Below the reader finds a brief report of the aforementioned process and the first five records of the dataframe.

Properties data acquisition report
Number of json files (Rightmove search pages) read: 174
Total number of properties acquired: 4243
Total number of properties without duplicates and "null" values: 1395
Total number of properties without unlikely size values and not interesting types: 1376
Total number of properties inside interesting boroughs: 504

Out[28]:
Address Type Bedrooms Size Price Latitude Longitude Size/bedroom Price/sqmt Borough Distance
Id
51117279 The Residence Hoxton, London, N1 Apartment 2 77.4 0.695 51.530851 -0.082688 38.7 8.98 Hackney 4.1
51117612 The Residence Hoxton, London, N1 Apartment 3 100.0 0.990 51.530851 -0.082688 33.3 9.90 Hackney 4.1
51117663 The Residence Hoxton, London, N1 Apartment 2 92.2 0.810 51.530851 -0.082688 46.1 8.79 Hackney 4.1
53080209 EC2A 2FA Apartment 2 71.5 1.386 51.524200 -0.083677 35.8 19.38 Hackney 3.5
54230967 Pentonville Rd, Pentonville, N1 Flat 4 162.2 2.990 51.531734 -0.113260 40.6 18.43 Islington 3.1

As we can see from the brief report, we ended with a sample of 1376 "valid" properties on a total of 4243 sale adverts acquired from Rightmove.
This reduction is due to the fact that many sale adverts do not provide size values in the proper field and this information should be acquired in other ways. For example, many adverts place the size data directly in the description of the properties, so we should define a text search algorithm to extract it. Considering the scope of this document, we will stay with the above sample and ignore properties with no size data recorded in the proper field.
Of course not all the properties found are located in the aforementioned interesting boroughs, so the final count is 504 properties that are usable for the next steps.

In the histograms below we can observe distribution of boroughs and house types among the residential properties population. We can observe that:

  • more than an half of the properties are located in the boroughs of Hackney and Lambeth
  • vast majority of the sample contains flats and apartments

We can see now distributions of other features. The width of each bar of the histograms below represents a range of values whose limits are indicated on the x axis. The ranges are intended starting value included and ending value excluded. The height of the darker green bar represents the percentage of properties with given value in that range and the height of the lighter green bar is the cumulative output. We can observe that:

  • almost an half of the sample is represented by properties with two beds
  • vast majority of the properties have size between 50 and 100 square meters
  • over an half of the properties has a price between 500.000 and 1.000.000 GBP
  • majority of the properties have size per bedroom in the range between 30 and 40 square meters
  • almost 70% of the properties has a price per square meter in the range between 5.000 and 15.000 GBP
  • over an half of the sample is within a distance of less than 6 kilometres from the London centre

Acquisition of residential properties surroundings data

Now, we will acquire data about each property surroundings thanks to the Foursquare developer API.
Foursquare is the most trusted, independent location data platform for understanding how people move through the real world. It combines the rich attributes of over 105 million global points-of-interest with the understanding of human movement from over 500 million devices.
The developer API allows the user the ability to acquire, via coding, the surrounding venues given a location coordinates and a search radius. Also it consents to refine queries so that the output contains only venues of a given category. We will feed the API with properties location coordinates, set the search radius at 400 meters (5 minutes walking) and an output limit of 100 venues for each category and location. Finally, for every property, we will run a search iteration for each of these category:

  • "Entertainment", corresponding to the merging of "Arts & Entertainment" and "Outdoors & Recreation" Foursquare categories
  • "Service", corresponding to the merging of "College & University", "Professional & Other Places", and "Shop & Service" Foursquare categories
  • "Food & Night", corresponding to the merging of "Food" and "Nightlife Spot" Foursquare categories
  • "Transport", corresponding to the "Travel & Transport" Foursquare categories

In the histograms below we can observe the distribution of surrounding venues categories among the properties. Each graph shows, in darker green, the number of properties in percentage at a given amount of surrounding venues. Every venues count was broken down in an appropriate number of bins, each one containing 5 counting unit. In lighter green, we can see the corresponding cumulative histograms.

We can observe:

  • more than an half of properties has 5 or less entertainment venues surrounding and only 15% has more than 15
  • above four fifth of the properties has 6 or more service venues, and over an half has 11 or more
  • only slightly less than a fourth of the properties has 5 or less "Food & Night" venues while almost an half has more than 15
  • Similarly, only a third of the properties has 5 or less "Transport" venues nearby

Residential properties and venues inspection with interactive map

In the map below we can see, with different markers colours, properties and surrounding venues collected in the previous stages.
London boroughs are highlighted in light green and the interesting ones in darker green.
Positions of properties are represented by black markers while venues are represented in the following colours: magenta for "Entertainment" venues, blue for "Service" venues, red for "Food & Night" venues and orange for "Transport" venues.

Map can be scaled and panned interactively and markers information can be obtained by hovering the mouse over them.

Out[33]:

Rating residential properties

To progress in our journey in the London residential housing market, we will create a properties "rating" dataframe similarly as we did for boroughs. For this task we will use these features:

  • "Type", which will be a numerical conversion of the house types (the higher will be the value the more the prestige)
  • "Spaciousness", which is the already defined Size/Bedroom (the higher the better)
  • "Affordability", which is the opposite of Price/sqmt (the lower the price/sqmt the better the affordability)
  • "Position", which is the opposite of the distance from London centre (the less distance from centre, the better)
  • "Entertainments", "Services", "Food & Night" and "Transports", which are the number of corresponding venues in five minutes walking range (the higher the better)
  • "Borough RIS", the residential investment score of the belonging borough

All the features will be min-max normalized in the zero-one range and, below, the first five rows of this new dataframe.
Also the RIS score, which was between 0.6 and 1.0 due the previous boroughs selection, has been normalized again to create more "distance" for the clustering algorithms we will use in the next stages of the analysis. As it has been clarified in the previous paragraphs, this is because we want to maximize the weight of this important feature.

Out[34]:
Type Spaciousness Affordability Position Entertainments Services Food & Night Transports Borough RIS
Id
51117279 0.0 0.193598 0.853385 0.845833 0.230769 0.26 0.36 0.100000 0.208092
51117612 0.0 0.111280 0.823869 0.845833 0.230769 0.26 0.36 0.100000 0.208092
51117663 0.0 0.306402 0.859480 0.845833 0.230769 0.26 0.36 0.100000 0.208092
53080209 0.0 0.149390 0.519731 0.870833 0.923077 0.52 1.00 0.414286 0.208092
54230967 0.0 0.222561 0.550209 0.887500 0.153846 0.16 0.21 0.214286 0.161850

Clustering residential properties

As did before with boroughs, we will use K-means and DBSCAN algorithms on the properties ratings dataframe to see if we can obtain some clusters populated with interesting properties and we will again apply some weights to the features before fitting the models. Ideally these weights represent a particular interest for a feature we want to privilege. Apart from the fact that is objectively important to favour the RIS score since it synthesizes the probability of the selected properties value increase in the future, the weights of all the other features are subjective and depend on the investor profile.

Let's focus for example on an investor that want a very affordable solution in a good central position with sufficient services and some entertainment around. Of course, all while maximizing the quality of the investment. We could profile such an investor with the following weights:

  • Type: 25%
  • Spaciousness: 50%
  • Affordability: 100%
  • Position: 75%
  • Entertainments: 25%
  • Services: 50%
  • Food & Night: 25%
  • Transports: 50%
  • Borough RIS: 100%

Going on with the clustering algorithms, as shown in the plots below, we find an optimum number of K-means clusters of four, with a relatively small elbow and a silhouette score of 0,334.

Analisys report
Combinations of N° of clusters and N° of random states evaluated: 140
Best silhouette score: 0.334
Best silhouette score N° of clusters: 4
Best silhouette score random state: 0

In the following picture we further analyse the K-means algorithm output with number of clusters set at 4. Let's point out some clues we can see:

  • Cluster N°0, is characterized by properties that are near the London centre and have abundance of venues surrounding. As we may expect, they are also the least affordable properties even though there are some exceptions that could be very interesting.
  • Cluster N°1 contains more affordable properties that still are in a central position. This is the second best cluster regarding venues abundance.
  • Cluster N°2 is somewhat similar to N°1. As we may see here the properties are less central and count less venues around but they are also more affordable.
  • Cluster N°3 contains all the properties located in Havering. This is evident watching the Borough RIS parameter that is at its top in this borough. They are also among the most affordable but also the most isolated.

Let's now try to form some clusters with the DBSCAN algorithm. As did before we will iterate through different values of epsilon and min samples and the graphs below shows the trends of number of clusters formed, noisy samples and largest cluster sizes. We can see an interesting spot at epsilon equal 0.16 and min samples equal to 20. Indeed, looking the general trend of the plots, we can observe how a further increase of epsilon tends to rapidly create an unique large cluster while a decrease augment the noise level.

Analisys report
Combinations of N° of epsilon and N° of minimum samples evaluated: 204

In the following picture we further analyse the DBSCAN algorithm output with the aforementioned values of epsilon and min samples. Let's what clues we can find:

  • Cluster N°0, is by far the most populated one. It's composition is characterized by properties with above average central position and affordability but lack of nearby venues.
  • Cluster N°1 contains less affordable properties that are in a very central position and above average venues counts. Despite this, the affordability of these properties is in the average values so this cluster seems very interesting.
  • Cluster N°2 groups the most affordable properties. We can see also less surrounding venues and a position in the average.
  • Cluster N°3 is similar to N°2. The difference is that properties are located in the worst boroughs by RIS score (among those previously selected with RIS above 0.6). These properties are also more expensive than those grouped in cluster N°2 and in a better central position.
  • Cluster N°4 is similar to N°3. We can observe a more central position and, correspondingly a bigger presence of surrounding venues at the price of a slightly decrease of affordability.
Analisys report
Total number of clustered samples: 451
Number of unclustered samples: 53

Results

After the clustering process we obtained some groups that seem to contain residential properties that match our investment preferences (read: weights). In particular K-means cluster N°0 and, more precisely, DBSCAN cluster N°1. Even if they are characterized by an average affordability level, which is our main concern according to the proposed weights, they are also notable for the outstanding central position of the properties contained and for the abundance of surrounding venues.

To define a final ranking we will proceed in the same way proposed for the definition of the RIS score. So, we will define a new properties feature that will synthetize all the others and will be obtained from the weighted sum of the other features. For this task we will use the original boroughs RIS scores and not the scaled ones used for the properties clustering processes.

In the table below we can see the final ranking of the first 25 properties ordered by descending overall score and, as we may expect, they all belongs to the aforementioned clusters.

In conclusion we can say that, for the kind of investor represented by the weights previously defined, the uncontested winner of The Battle of Neighbourhoods is the borough of Lambeth, even if is not the most promising in term of investment quality. Indeed, still being a good choice from a pure real estate investment point of view, it prevails thanks to its outstanding central position and abundance of all kinds of venues at 5 mins walking range. Yet all this at a quite affordable price considering London center rates.

Out[39]:
Borough Type RIS Spaciousness Affordability Position Entertainments Services Food & Night Transports Overall Score K-mean Cluster DBSCAN Cluster
Id
73540722 Lambeth Apartment 0.67 0.54 0.63 0.99 0.96 0.42 0.73 0.94 1.00 0 1
74253939 Lambeth Apartment 0.67 0.52 0.62 0.99 1.00 0.37 0.67 0.93 0.97 0 1
74253936 Lambeth Apartment 0.67 0.52 0.62 0.99 1.00 0.37 0.67 0.93 0.97 0 1
76403126 Lambeth Apartment 0.67 0.54 0.65 0.99 0.88 0.38 0.59 0.91 0.96 0 1
74253945 Lambeth Apartment 0.67 0.52 0.62 0.99 1.00 0.37 0.67 0.93 0.96 0 1
94206326 Lambeth Apartment 0.67 0.29 0.68 0.99 0.73 0.44 0.81 1.00 0.95 0 1
98334512 Lambeth Flat 0.67 0.38 0.57 0.99 0.96 0.46 0.77 0.94 0.94 0 1
73897482 Lambeth Apartment 0.67 0.38 0.61 1.00 0.92 0.37 0.71 0.96 0.92 0 1
75437676 Lambeth Apartment 0.67 0.52 0.52 0.99 1.00 0.37 0.67 0.93 0.91 0 1
75277581 Lambeth Apartment 0.67 0.38 0.53 0.99 0.77 0.43 0.80 0.94 0.88 0 1
75306222 Lambeth Apartment 0.67 0.30 0.57 0.99 0.77 0.43 0.80 0.94 0.88 0 1
57137964 Lambeth Apartment 0.67 0.38 0.53 0.99 0.88 0.38 0.59 0.91 0.85 0 1
87440440 Lambeth Apartment 0.67 0.44 0.39 0.99 0.96 0.44 0.75 0.94 0.84 0 1
75306210 Lambeth Apartment 0.67 0.15 0.55 0.99 0.77 0.43 0.80 0.94 0.83 0 1
75277593 Lambeth Apartment 0.67 0.19 0.50 0.99 0.77 0.43 0.80 0.94 0.81 0 1
54826797 Lambeth Apartment 0.67 0.38 0.56 1.00 0.65 0.32 0.65 0.87 0.81 0 1
79398886 Lambeth Apartment 0.67 0.41 0.38 0.99 0.77 0.44 0.77 0.97 0.81 0 1
99835166 Lambeth Apartment 0.67 0.34 0.41 1.00 0.96 0.37 0.71 0.96 0.81 0 1
87105763 Lambeth Apartment 0.67 0.23 0.41 0.99 0.96 0.44 0.75 0.94 0.80 0 1
99778481 Lambeth Apartment 0.67 0.38 0.38 1.00 0.96 0.37 0.71 0.96 0.80 0 1
83142793 Lambeth Apartment 0.67 0.37 0.37 0.99 0.77 0.44 0.77 0.97 0.80 0 1
92009732 Lambeth Apartment 0.67 0.41 0.36 1.00 0.96 0.37 0.71 0.96 0.80 0 1
79398829 Lambeth Apartment 0.67 0.39 0.35 0.99 0.77 0.44 0.77 0.97 0.79 0 1
68802561 Lambeth Apartment 0.67 0.34 0.56 1.00 0.58 0.33 0.66 0.84 0.79 0 1
92010191 Lambeth Apartment 0.67 0.38 0.35 1.00 0.96 0.37 0.71 0.96 0.78 0 1

Discussion

As we clearly understand, the final standing is heavily influenced by the weights proposed, first, to cluster boroughs and define RIS score, second, to cluster properties and define overall properties rating. So in this section we want to focus a bit more on these numbers.

The boroughs weights system is an objective product of "common sense" and it can be optimized. In other words we can say that there is a particular weights vector that, applied to the boroughs features, optimizes the ability of the model to predict the future trend of the housing market.
Of course is not an easy task to find that particular vector and, somehow, the selection proposed in this document resemble a mere bet. A method to optimize these weights could be to reproduce all the analysis concerning the boroughs at a past date so that we can "measure" the final result with a known past value of the housing market trend.
For example, we could choose as a current date December 2018, calculate all the boroughs ratings with predictions at one year, so December 2019, apply the weights system and compare the model performance with the real housing market value at December 2019. Iterating through this process for a sufficient amount of times could lead eventually to a certain weights vector that particularly optimizes the predictive performance of the model.

Of a totally different nature is the weights vector introduced to proceed through the residential properties selection. As stated previously, they synthetize the investor profile and his subjective investment preferences. In other words, changing this weights vector moves the point of view of the analysis.

Given that, we find particularly interesting to check the point of view of a pure investor whose goal is only to maximize the future returns. For this task we could set a weights vector like below:

  • Type: 0%
  • Spaciousness: 10%
  • Affordability: 100%
  • Position: 10%
  • Entertainments: 10%
  • Services: 20%
  • Food & Night: 10%
  • Transports: 20%
  • Borough RIS: 100%

Without going through the clustering processes, we will directly jump to calculate the Overall Score as did before.
Looking at the results shown in the table below we can see how, from this different perspective, the boroughs of Havering, first, and Barnet, second, are more interesting.

Out[40]:
Borough Type RIS Spaciousness Affordability Position Entertainments Services Food & Night Transports Overall Score
Id
97154141 Havering Duplex 1.00 0.09 0.99 0.18 0.00 0.03 0.00 0.03 1.00
97154852 Havering Apartment 1.00 0.07 0.99 0.18 0.00 0.03 0.00 0.03 1.00
101190266 Havering Detached 1.00 0.07 0.97 0.00 0.04 0.15 0.04 0.07 1.00
97160357 Havering Apartment 1.00 0.07 0.99 0.18 0.00 0.03 0.00 0.03 1.00
97164023 Havering Apartment 1.00 0.11 0.97 0.18 0.00 0.03 0.00 0.03 0.99
100736492 Havering Apartment 1.00 0.11 0.91 0.10 0.04 0.05 0.03 0.00 0.95
99047096 Havering Apartment 1.00 0.08 0.90 0.07 0.00 0.04 0.03 0.06 0.94
94158866 Barnet Semi-Detached 0.75 0.57 0.98 0.46 0.04 0.01 0.01 0.06 0.65
98200358 Barnet Apartment 0.75 0.21 0.96 0.44 0.00 0.07 0.13 0.06 0.62
62501793 Barnet Flat 0.75 0.36 0.93 0.66 0.00 0.11 0.04 0.07 0.62
67375443 Barnet Flat 0.75 0.36 0.91 0.66 0.00 0.11 0.04 0.07 0.62
64602930 Barnet Flat 0.75 0.36 0.92 0.66 0.00 0.11 0.04 0.07 0.62
94738283 Barnet Semi-Detached 0.75 0.43 0.96 0.45 0.04 0.05 0.01 0.00 0.62
73551661 Barnet Flat 0.75 0.19 0.94 0.62 0.08 0.03 0.04 0.07 0.62
94158869 Barnet Semi-Detached 0.75 0.45 0.96 0.46 0.04 0.01 0.01 0.06 0.62
99870122 Barnet Apartment 0.75 0.70 0.92 0.47 0.04 0.03 0.01 0.09 0.62
99334322 Barnet Apartment 0.75 0.13 0.95 0.48 0.00 0.11 0.03 0.01 0.61
99870134 Barnet Apartment 0.75 0.46 0.92 0.47 0.04 0.03 0.01 0.09 0.61
65784120 Barnet Town House 0.75 0.16 0.94 0.45 0.00 0.09 0.13 0.03 0.61
94206326 Lambeth Apartment 0.67 0.29 0.68 0.99 0.73 0.44 0.81 1.00 0.60
67375446 Barnet Flat 0.75 0.11 0.91 0.66 0.00 0.11 0.04 0.07 0.60
67229934 Barnet Apartment 0.75 0.16 0.87 0.63 0.00 0.19 0.24 0.09 0.60
62501271 Barnet Flat 0.75 0.32 0.89 0.66 0.00 0.11 0.04 0.07 0.60
74176698 Hackney Apartment 0.69 0.53 0.79 0.84 0.62 0.44 0.37 0.24 0.59
98698949 Barnet Apartment 0.75 0.21 0.92 0.46 0.00 0.05 0.04 0.06 0.59

Conclusion

Further analysis of the two previous tables lead to these final conclusions:

  • From the perspective of a buyer that want to settle near London centre without being subdued to its highest rates, the best solutions are in Lambeth borough
  • From the perspective of a buyer that want exclusively to maximize the potential of his investment, the best solutions are in Havering borough
  • The borough of Barnet represents a middle way option in almost every category. It is surely a first choice if properties in Lambeth are considered too much expensive.

We hope the reader enjoyed this journey in the London housing market and we reproduce below the links to the Rightmove website adverts corresponding to the aforementioned tables, for further information.

First table properties urls:

Second table properties urls: